In this section, we compare cpprb with other replay buffer implementations;
Important Notice
Except cpprb and DeepMind/Reverb, replay buffers are only a part of their reinforcement learning ecosystem. These libraries don’t focus on providing greatest replay buffers but reinforcement learning.
Our motivation is to provide strong replay buffers to researchers and developers who not only use existing networks and/or algorithms but also creating brand-new networks and/or algorithms.
Here, we would like to show that cpprb is enough functional and enough efficient compared with others.
OpenAI Baselines is a set of baseline implementations of reinforcement learning developed by OpenAI.
The source code is published with MIT license.
Ordinary and prioritized experience replay are implemented with ReplayBuffer
and PrioritizedReplayBuffer
classes, respectively. Using these classes directly is (probably) not expected, but you can import them like this;
from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
ReplayBuffer
is initialized with size
parameter for replay buffer size. Additionally, PrioritizedReplayBuffer
requires alpha
parameter for degree of prioritization, too. These parameters doesn’t have default values, so that you need to specify them.
buffer_size = int(1e6)
alpha = 0.6
rb = ReplayBuffer(buffer_size)
prb = PrioritizedReplayBuffer(buffer_size,alpha)
A transition is stored into replay buffer by calling ReplayBuffer.add(self,obs_t,action,reward,obs_tp1,done)
.
For PrioritizedReplayBuffer
, the maximum priority at that time is automatically used for a newly added transition.
These replay buffers are ring buffers, so that the oldest transition is overwritten by a new transition after the buffer becomes full.
obs_t = [0, 0, 0]
action = [1]
reward = 0.5
obs_tp1 = [1, 1, 1]
done = 0.0
rb.add(obs_t,action,reward,obs_tp1,done)
prb.add(obs_t,action,reward,obs_tp1,done) # Store with max. priority
Stored transitions can be sampled by calling ReplayBuffer.sample(self,batch_size)
or PrioritizedReplayBuffer.sample(self,batch_size,beta)
.
ReplayBuffer
returns a tuple of batch size transition. PrioritizedReplayBuffer
also returns weights and indexes, too.
batch_size = 32
beta = 0.4
obs_batch, act_batch, rew_batch, next_obs_batch, done_mask = rb.sample(batch_size)
obs_batch, act_batch, rew_batch, next_obs_batch, done_mask, weights, idxes = prb.sample(batch_size)
Priorities can be updated by calling PrioritizedReplayBuffer.update_priorities(self,idxes,priorities)
.
prb.update_priorities(idxes,priorities)
Internally, these replay buffers utilize Python list
for storage, so that the memory usage gradually increase until the buffer becomes full.
RLlib is reinforcement learning library based on distributed framework Ray.
The source code is published with Apache-2.0 license.
Ordinary and prioritized experience replay are implemented with ReplayBuffer
and PrioritizedReplayBuffer
classes, respectively.
These classes are decorated with @DeveloperAPI
, which are intended to be used by developer when making custom algorithms.
from ray.rllib.execution.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
These replay buffer classes initialize like OpenAI Baselines;
buffer_size = int(1e6)
alpha = 0.6
rb = ReplayBuffer(buffer_size)
prb = PrioritizedReplayBuffer(buffer_size,alpha)
A transition is stored by calling ReplayBuffer.add(self,item,weight)
and PrioritizedReplayBuffer.add(self,item,weight)
. The item
is a instance of ray.rllib.policy.sample_batch.SampleBatch
. (The API was changed at ray-0.8.7.)
The SampleBatch
class can have any kind of values with batch size, however, only a single transition is allowed for addition. (If you pass SampleBatch
having multi-transitions to add()
method, it will produce unintentional bug probably.) The key of SampleBatch
can be defined by user freely.
In RLlib, PrioritizedReplayBuffer
can take weight
parameter to specify priority at the same time. In order to unify their API, ReplayBuffer
also requires weight
parameter, even though it is not used at all. Moreover, the weight
does not have default parameter (such as None
), you need to pass something explicitly.
obs_t = [0, 0, 0]
action = [1]
reward = 0.5
obs_tp1 = [1, 1, 1]
done = 0.0
weight = 0.5
# For addition, SampleBatch must be initialized with a set of `list` having only a single element.
rb.add(SampleBatch(obs=[obs_t],
action=[action],
reward=[reward],
new_obs=[obs_tp1],
done=[done]),None)
prb.add(SampleBatch(obs=[obs_t],
action=[action],
reward=[reward],
new_obs=[obs_tp1],
done=[done]),weight)
Like OpenAI Baselines, stored transitions can be sampled by calling ReplayBuffer.sample(self,batch_size)
or PrioritizedReplayBuffer.sample(self,batch_size,beta)
.
ReplayBuffer
returns SampleBatch
of batch size transition. (The API was changed at ray-0.8.7.) PrioritizedReplayBuffer
also returns weights
and batch_indexes
in SampleBatch
, too.
batch_size = 32
beta = 0.4
sample = rb.sample(batch_size)
obs_batch, act_batch, rew_batch, next_obs_batch, done_mask = sample["obs"], sample["action"], sample["reward"], sample["new_obs"], sample["done"]
sample = prb.sample(batch_size)
obs_batch, act_batch, rew_batch, next_obs_batch, done_mask, weights, idxes = sample["obs"], sample["action"], sample["reward"], sample["new_obs"], sample["done"], sample["weights"], sample["batch_indexes"]
Priorities can be also updated by calling PrioritizedReplayBuffer.update_priorities(self,idxes,priorities)
, too.
prb.update_priorities(idxes,priorities)
Internally, these replay buffers utilize Python list
for storage, so that the memory usage gradually increase until the buffer becomes full.
ChainerRL is a deep reinforcement learning library based on a framework Chainer. Chainer (including ChainerRL) has already stopped active development, and development team (Preferred Networks) joined to PyTorch development.
The source code is published with MIT license.
Ordinary and prioritized experience replay are implemented with ReplayBuffer
and PrioritizedReplayBuffer
, respectively.
from chainerrl.replay_buffers import ReplayBuffer, PrioritizedReplayBuffer
ChainerRL has slightly different API from OpenAI Baselines’ one.
ReplayBuffer
is initialized with capacity=None
parameter for buffer size and num_steps=1
parameter for Nstep configuration.
PrioritizedReplayBuffer
can additionally take parameters of alpha=0.6
, beta0=0.4
, betasteps=2e5
, eps=0.01
, normalize_by_max=True
, error_min=0
, error_max=1
.
In ChainerRL, beta (correction for weight) parameter starts from beta0
, automatically increases with equal step size during betasteps
time iterations and after that becomes 1.0
.
buffer_size = int(1e6)
alpha = 0.6
rb = ReplayBuffer(buffer_size)
prb = PrioritizedReplayBuffer(buffer_size,alpha)
A transition is stored by calling ReplayBuffer.append(self,state,action,reward,next_state=None,next_action=None,is_state_terminate=False,env_id=0,**kwargs)
. Additional keyward arguments are stored, too, so that you can use any custom environment values. By specifying env_id
, multiple trajectory can be tracked with Nstep configuration.
obs_t = [0, 0, 0]
action = [1]
reward = 0.5
obs_tp1 = [1, 1, 1]
done = False
rb.add(obs_t,action,reward,obs_tp1,is_state_terminal=done)
prb.add(obs_t,action,reward,obs_tp1,is_state_terminal=done)
Stored transitions are sampled by calling ReplayBuffer.sample(self,num_experience)
and PrioritizedReplayBuffer.sample(self,n)
.
Apart from other implementations, ChainerRL’s replay buffers return unique (non-duplicated) transitions, so that batch_size must be smaller than stored transition size. Furthermore, they return a Python list
of a dict
of transition instead of a Python tuple
of environment values.
batch_size = 32
# Need additional modification to take apart
transitions = rb.sample(batch_size)
transitions_with_weight = prb.sample(batch_size)
Update index cannot be specified manually, but PrioritizedReplayBuffer
memorizes sampled indexes. (Without sample
, user cannot update priority.)
prb.update_priorities(priorities)
Internally, these replay buffers utilize Python list
for storage, so that the memory usage gradually increase until the buffer becomes full. In ChainerRL, storage is not a simple Python list
, but two list
to pop out the oldest element with O(1) time.
Reverb is relatively new, which was released on 26th May 2020 by DeepMind.
Reverb is a framework for experience replay like cpprb. By utilizing server-client model, Reverb is mainly optimized for large-scale distributed reinforcement learning.
The source code is published with Apache-2.0 license.
Currently (28th June 2020), Reverb officially says it is still non-production level and requires a development version of TensorFlow (i.e. tf-nightly 2.3.0.dev20200604).
Ordinary and prioritized experience replay are constructed by setting reverb.selectors.Uniform()
and reverb.selectors.Prioritized(alpha)
to sampler
argument in reverb.Table
constructor, respectively.
Following sample code constructs a server with two replay buffers listening a port 8000
.
import reverb
buffer_size = int(1e6)
alpha = 0.6
server = reverb.Server(tables=[reverb.Table(name="ReplayBuffer",
sampler=reverb.selectors.Uniform(),
remover=reverb.selectors.Fifo(),
rate_limiter=reverb.rate_limiters.MinSize(1),
max_size=buffer_size),
reverb.Table(name="PrioritizedReplayBuffer",
sampler=reverb.selectors.Prioritized(alpha),
remover=reverb.selectors.Fifo(),
rate_limiter=reverb.rate_limiters.MinSize(1),
max_size=buffer_size)],
port=8000)
By changing selector
and remover
, we can use different algorithms for sampling and overwriting, respectively.
Supported algorithms implemented at reverb.selectors
are following;
Uniform
: Select uniformly.Prioritized
: Select proportional to stored priorities.Fifo
: Select oldest data.Lifo
: Select newest data.MinHeap
: Select data with lowest priority.MaxHeap
: Select data with highest priority.There are 3 ways to store a transition.
The first method uses reverb.Client.insert
. Not only prioritized replay buffer but also ordinary replay buffer requires priority even though it is not used.
import reverb
client = reverb.Client(f"localhost:{server.port}")
obs_t = [0, 0, 0]
action = [1]
reward = [0.5]
obs_tp1 = [1, 1, 1]
done = [0]
client.insert([obs_t,action,reward,obs_tp1,done],priorities={"ReplayBuffer":1.0})
client.insert([obs_t,action,reward,obs_tp1,done],priorities={"PrioritizedReplayBuffer":1.0})
The second method uses reverb.Client.writer
, which is internally used in reverb.Client.insert
, too. This method can be more efficient because you can flush multiple items together by calling reverb.Writer.close
insead of one by one.
import reverb
client = reverb.Client(f"localhost:{server.port}")
obs_t = [0, 0, 0]
action = [1]
reward = [0.5]
obs_tp1 = [1, 1, 1]
done = [0]
with client.writer(max_sequence_length=1) as writer:
writer.append([obs_t,action,reward,obs_tp1,done])
writer.create_item(table="ReplayBuffer",num_timesteps=1,priority=1.0)
writer.append([obs_t,action,reward,obs_tp1,done])
writer.create_item(table="PrioritizedReplayBuffer",num_timesteps=1,priority=1.0)
The last method uses reverb.TFClient.insert
. This class is designed to be used in TensorFlow graph.
import reverb
tf_client = reverb.TFClient(f"localhost:{server.port}")
obs_t = tf.constant([0, 0, 0])
action = tf.constant([1])
reward = tf.constant([0.5])
obs_tp1 = tf.constant([1, 1, 1])
done = tf.constant([0])
tf_client.insert([obs_t,action,reward,obs_tp1,done],
tables=tf.constant(["ReplayBuffer"]),
priorities=tf.constant([1.0],dtype=tf.float64))
tf_client.insert([obs_t,action,reward,obs_tp1,done],
tables=tf.constant(["PrioritizedReplayBuffer"]),
priorities=tf.constant([1.0],dtype=tf.float64))
tables
parameter must be tf.Tensor
of str
with rank 1 and priorities
parameter must be tf.Tensor
of float64
with rank 1. The lengths of tables
and priorities
must match.
Sampling transitions can be realized by 3 ways, too.
The first method utilizes reverb.Client.sample
, which returns generator of reverb.replay_sample.ReplaySample
. As long as we can investigate, beta-parameter is not supported and weight is not calculated at prioritized experience replay.
batch_size = 32
transitions = client.sample("ReplayBuffer",num_samples=batch_size)
transitions_with_priority = client.sample("PrioritizedReplayBuffer",num_samples=batch_size)
The second method uses reverb.TFClient.sample
, which does not support batch sampling.
transition = tf_client.sample("ReplayBuffer",
[tf.float64,tf.float64,tf.float64,tf.float64,tf.float64])
transition_priority = tf_client.sample("PrioritizedReplayBuffer",
[tf.float64,tf.float64,tf.float64,tf.float64,tf.float64])
The last method is completely different from others, which calls reverb.TFClient.dataset
, returning reverb.ReplayDataset
derived from tf.data.Dataset
.
Once creating ReplayDataset
, the dataset can be used as generator and automatically fetches transitions from its replay buffer with proper timing.
dataset = tf_client.dataset("ReplayBuffer",
[tf.float64,tf.float64,tf.float64,tf.float64,tf.float64],
[4,1,1,4,1])
dataset_priority = tf_client.dataset("PrioritizedReplayBuffer",
[tf.float64,tf.float64,tf.float64,tf.float64,tf.float64],
[4,1,1,4,1])
Priorities can be updated by reverb.Client.mutate_priorities
or reverb.TFClient.update_priorities
. Aside from other implementations, key is not integer sequence but hash, so that the key must be taken from sampled items by accsessing ReplaySample.info.key
.
for t in transitions_with_priority:
client.mutate_priorities("PriorotizedReplayBuffer",updates={t.info.key: 0.5})
tf_client.update_priorities("PrioritizedReplayBuffer",
transition_priority.info.key,
priorities=tf.constant([0.5],dtype=tf.float64)
There are also some other replay buffer implementations, which we couldn’t review well. In future, we would like to investigate these implementations and compare with cpprb.